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Abstract 

Motivation: Several different threads of research have 
been proposed for modehng and mining temporal data. 
On the one hand, approaches such as dynamic Bayesian 
networks (DBNs) provide a formal probabilistic basis to 
model relationships between time-indexed random vari- 
ables but these models are intractable to learn in the 
general case. On the other, algorithms such as frequent 
episode mining are scalable to large datasets but do not 
exhibit the rigorous probabilistic interpretations that are 
the mainstay of the graphical models literature. 
Results: We present a unification of these two seemingly 
diverse threads of research, by demonstrating how dy- 
namic (discrete) Bayesian networks can be inferred from 
the results of frequent episode mining. This helps bridge 
the modeling emphasis of the former with the counting 
emphasis of the latter. First, we show how, under rea- 
sonable assumptions on data characteristics and on in- 
fluences of random variables, the optimal DBN structure 
can be computed using a greedy, local, algorithm. Next, 
we connect the optimality of the DBN structure with the 
notion of fixed-delay episodes and their counts of distinct 
occurrences. Finally, to demonstrate the practical feasi- 
bility of our approach, we focus on a specific (but broadly 
applicable) class of networks, called excitatory networks, 
and show how the search for the optimal DBN structure 
can be conducted using just information from frequent 
episodes. Application on datasets gathered from mathe- 
matical models of spiking neurons as well as real neuro- 
science datasets are presented. 

Availability: Algorithmic implementations, simulator 
codebases, and datasets are available from our website 
at http : / /neural-code . cs . vt . edu / dbn . 

Keywords: Event sequences, dynamic Bayesian 
networks, temporal probabilistic networks, frequent 
episodes, temporal data mining. 



1 Introduction 

Probabilistic modeling of temporal data is a thriving area 
of research. The development of dynamic Bayesian net- 
works as a subsuming formulation to HMMs, Kalman 



filters, and other such dynamical models has promoted 
a succession of research aimed at capturing probabilis- 
tic dynamical behavior in complex systems. DBNs bring 
to modeling temporal data the key advantage that tra- 
ditional Bayesian networks brought to modeling static 
data, i.e., the ability to use graph theory to capture prob- 
abilistic notions of independence and conditional inde- 
pendence. They are now widely used in bioinformatics, 
neuroscience, and linguistics applications. 

A contrasting line of research in modeling and mining 
temporal data is the counting based literature, exem- 
plified in the KDD community by works such as [9, 7]. 
Similar to frequent itemsets, these papers define the no- 
tion of frequent episodes as objects of interest. Identi- 
fying frequency measures that support efficient counting 
procedures (just as support does for itemsets) has been 
shown to be crucial here. 

It is natural to question whether these two threads, 
with divergent origins, can be related to one another. 
Many researchers have explored precisely this ques- 
tion. The classic paper by Pavlov, Mannila, and 
Smyth [13] used frequent itemsets to place constraints 
on the joint distribution of item random variables and 
thus aid in inference and query approximation. Chao and 
Parthasarathy [16] view probabilistic models as summa- 
rized representations of databases and demonstrate how 
to construct MRF (Markov random field) models from 
frequent itemsets. Closer to the topic of this paper, the 
work by Laxman et al. [7] linked frequent episodes to a 
generative HMM-like model of the underlying data. 

Similar in scope to the above works, we present a unifi- 
cation of the goals of dynamic Bayesian network inference 
with that of frequent episode mining. Our motivations 
are not merely to establish theoretical results but also to 
inform the computational complexity of algorithms and 
spur faster algorithms targeted for specific domains. The 
key contributions are: 

1. We show how, under reasonable assumptions on 
data characteristics and on infiuences of random 
variables, the optimal DBN structure can be estab- 
lished using a greedy, local, approach, and how this 
structure can be computed using the notion of fixed- 
delay episodes and their counts of distinct occur- 
rences. 
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2. We present a specific (but broadly applicable) class 
of networks, called excitatory networks, and show 
how the search for the optimal DBN structure can 
be conducted using just information from frequent 
episodes. 

3. We demonstrate a powerful application of our meth- 
ods on datasets gathered from mathematical mod- 
els of spiking neurons as well as real neuroscience 
datasets. 

2 Bayesian Networks: Static and 
Dynamic 

Formal mathematical notions are presented in the next 
section, but here we wish to provide some background 
context to past research in Bayesian networks (BNs). As 
is well known, BNs use directed acyclic graphs to encode 
probabilistic notions of conditional independence, such 
as that a node is conditionally independent of its non- 
descendants given its parents (for more details, see [5]). 
The earliest known work for learning BNs is the Chow- 
Liu algorithm [2]. It showed that, if we restricted the 
structure of the BN to be a tree, then the optimal BN can 
be computed using a minimum spanning tree algorithm. 
It also established the tractability of BN inference for 
this class of graphs. 

More recent work, by Williamson [17], generalizes the 
Chow-Liu algorithm to show how (discrete) distributions 
can be generally approximated using the same ingredi- 
ents that are used by the Chow-Liu approach, namely 
mutual information quantities between random variables. 
Meila [10] presents an accelerated algorithm that is tar- 
geted toward sparse datasets of high dimensionality. The 
approximation thread for general BN inference is perhaps 
best exemplified by Friedman's sparse candidate algo- 
rithm [3] that presents various approaches to greedily 
learn (suboptimal) BNs. 

Dynamic Bayesian networks are a relatively newer de- 
velopment and best examples of them can be found in 
specific state space and dynamic modeling contexts, such 
as HMMs. In contrast to their static coimterparts, exact 
and efficient inference for general classes of DBNs has 
not been well studied. 

3 Optimal DBN structure 

Consider a finite alphabet, £ = {Ai,. . . ,Am}, 
of event- types (or symbols). Let s = 

{{Ei,ti), {E2,t2), {En,tn)) dcnotc a data stream of 
n events over £. Each Ei,i = 1, ... ,n, is & symbol from 
£. Each ti,i = l,...,n, takes values from the set of 
positive integers. The events in s are ordered according 
to their times of occurrence, ti+i > ti,i = 1, . . . , {n — 1). 
The time of occurrence of the last event in s, is 



denoted by i„ = T. We model the data stream, s, 
as a realization of a discrete-time stochastic process 
X(i),t = l,...,r; X{t) = [Xiit)X2it)---XM{t)]', 
where Xj(t) is an indicator variable for the occurrence 
of event type, Aj G £, at time t. Thus, for j = 1, . . . ,M 
and t — 1, . . . , T, we will have Xj(t) = 1 if {Aj,t) G s, 
and Xj{t) = otherwise. Each Xj{t) is referred to as 
the event-indicator random variable for event-type, Aj, 
at time t. 

Example 1 The following is an example event sequence 
of n = 7 events over an alphabet, £ = {A, B,C,. . . , Z}, 
of M = 26 event-types: 

{{A, 2), {B, 3), {D, 3), {B, 5), {C, 9), {A, 10), {D, 12)) (1) 

The maximum time tick is given by T = 12. Each 
X(t), t = 1, . . . , 12, is a vector of M = 2Q indicator ran- 
dom variables. Since there are no events at time t = 
in the example sequence (1), we have X(l) = 0. At 
time t = 2, we will have X(2) = [1000 • • -0]'. Similarly, 
X(3) = [0101 • • -0]', and so on. 

A Dynamic Bayesian Network [11] is essentially a DAG 
with nodes representing random variables and arcs repre- 
senting conditional dependency relationships. In this pa- 
per, we model the random process X(<) (or equivalently, 
the event stream s), as the output of a Dynamic Bayesian 
Network. Each event-indicator, Xj{t), t = 1, . . . ,T and 
j = 1, . . . M, corresponds to a node in the network, and 
is assigned a set of parents, which is denoted as Tr{X j{t)) 
(or TTjlt)). A parent-child relationship is represented by 
an arc (from parent to child) in the DAG. In a Bayesian 
Network, nodes are conditionally independent of their 
non-descendants given their parents. The joint probabil- 
ity distribution of the random process, X(i), under the 
DBN model, can be factorized as a product of conditional 
probabilities, P[Xj{t) \ 'iTj{t)], for various j,t. 

In general, given a node, Xj{t), any other X/j(t) can 
belong to its parent set, '!Tj{t). However, since each node 
has a time-stamp, it is reasonable to assume that a ran- 
dom variable, ^^(t), can only infiuence future random 
variables (i.e. those random variables associated with 
later time indices). Also, we can expect the infiuence 
of Xk{T) to diminish with time, and so we assume that 
Xfc(T) can be a parent of Xj{t) only if time t is within 
W time-ticks of time r (Typically, W will be small, like 
say, 5 to 10 time units). All of this constitutes our first 
constraint, Al, on the DBN structure. 

Al : For user-defined parameter, > 0, the set, Trj{t), 
of parents for the node, Xj{t), is a subset of event- 
indicators out of the W^-length history at time-tick, 
t, i.e. TTjit) C {Xfe(r) : l<k<M, {t-W) <t < 
t}. 

The DBN essentially models the time-evolution of the 
event-indicator random variables associated with the M 
event-types in the alphabet, £. By learning the DBN 
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structure, we expect to unearth relationships Hke "event 
B is more likely to occur at time i, if A occurred 3 time- 
ticks before t and if C occurred 5 time-ticks before t." In 
view of this, it is reasonable to assume that the parent- 
child relationships depend on relative (rather than ab- 
solute) time-stamps of random variables in the network. 
We refer to this as translation invariance of the DBN 
structure, and is specified below as the second constraint, 
A2, on the DBN structure. 

A2 : If TTj (t) — {Xj-^ (ti), . . . , Xj^ {te)} is an £-aize parent 
set of Xj{t) for some t > W, then for any other 
Xj{t'), t' > W, its parent set, TTj{t'), is simply a 
time-shifted version of ttj (t), and is given by tTj {t') = 
{Xj^ {ti+5),..., Xj, {te + S)}, where 5={t'-t). 

The data stream, s, is a long stream of events which 
we will regard as a realization of the stochastic process, 
X(i), t ^ 1, . . . ,r. While A2 is a sort of structural 
stationarity constraint on the DBN, in order to estimate 
marginals of the joint distribution from the data stream, 
we will also require that the distribution does not change 
when shifted in time. The stationarity assumption is 
stated in A3 below. 

A3 : For all j, S, given any set oil event-indicators, say, 
{Xj^{ti), . . . ,Xj^{te)}, the stationarity assumption 
requires that, P[Xj^{ti), . . .,Xj^{te)] = P[Xj^{ti + 
5), ...,X,,ite + 6)]. 

The joint probability distribution, Q[-], under the Dy- 
namic Bayesian Network model can be written as: 

Q[X(l),...,X(r)] -P[X(1),...,X(VF)] 

X ri I[PlXj{t)\7rjm (2) 

t=W+lj=l 

Learning the structure of the network involves learn- 
ing the map, Trj{t), for each Xj{t), j = 1,...,M and 
t = {W + 1), . . . ,T. In this paper, we derive an opti- 
mal structure for a Dynamic Bayesian Network, given 
an event stream, s, under assumptions Al, A2 and A3. 
Our approach follows the lines of [2, 17] where struc- 
ture learning is posed as a problem of approximating 
the discrete probability distribution, P[-], by the best 
possible distribution from a chosen model class (which, 
in our case, is the class of Dynamic Bayesian Networks 
constrained by Al and A2). The Kullback-Leibler di- 
vergence between the underlying joint distribution, P[-], 
of the stochastic process, and the joint distribution, Q[-], 
under the DBN model is given by 

Dkl{p\\Q) = Y^(p[M^),...,Mt)] 

A ^ 

P[X(1),...,X(T)]\ 



X log 



Q[X(1),...,X(T)]; 



(3) 



where A represents the set of all possible assignments 
for the T M-length random vectors, {X(l), . . . ,X(T)}. 



Denoting the entropy of P[X(1), . . . , X(T)] by H{P), the 
entropy of the marginal, P[X(1), . . . , X{W)], by H{Pw), 
and substituting for Q[-] from Eq. (2), we get 

Dkl{P\\Q) = -H{P) - H{Pw) - J2 (PW), ■ ■ • ,X(T)] 

A ^ 

M T V 

x^ ^ \ogP[Xi{t)\^M) (4) 

We now expand the conditional probabilities in Eq. (4) 
using Bayes rule, switch the order of summation and 
marginalize P\\ for each j, t. Denoting, for each j,t, 
the entropy of the marginal P\Xj{t^ by H{Pj^t), the ex- 
pression for KL divergence now becomes: 

M T 

Dkl{P\\Q) = -H{P)-H{Pw)-J2 E ^(^:'-.*) 

j=l t=W+l 

M T 

-EE i[xAt)\^5m (5) 

j=l t=W+l 

I[Xj{t) ; TTj{ty\ denotes the mutual information between 
Xj{t) and its parents, '!Tj{t), and is given by 

7[X,-(i);7r,(t)]=^(p[X,(t),7r,-(i)] 

Jlog^iMW^lL) (6) 

^P[X,{t)]P[7r,{t)]J ^ > 

where Aj.t represents the set of all possible assign- 
ments for the random variables, {Xj{t),Trj{t)}. Under 
the translation invariance constraint, A2, and the sta- 
tionarity assumption, A3, we have I[Xj(t) ; '!Tj{t)] = 
I[Xj{t') ; TTj{t')] for aU t > W, t' > W. This gives us the 
following final expression for -Difz,(P| IQ): 

M T 

Dkl{P\\Q)=-H{P)-H{Pw)-J2 E ^(^:'-.*) 

j=l t=w+i 

M 

-(T-iy)^/[X,(i);7r,(t)] (7) 

where t is any time-tick satisfying {W < t < T). We 
note that in Eq. (7), the entropies, H{P), H{Pw) and 
H{Pj^t) are independent of the DBN structure (i.e. they 
do not depend on the TTj{t) maps). Since (T — W) > 
and since I[Xj{t) ; nj{t)] > always, the KL divergence, 
Dkl{P\\Q), is minimized when the sum of M mutual 
information terms in Eq. (7) is maximized. Further, 
from Al wc know that all parent nodes of Xj{t) have 
time-stamps strictly less than t, and hence, no choice 
of TTj {t), j = 1, . . . ,M can result in a cycle in the net- 
work (in which case, the structure will not be a DAG, 
and in-turn, it will not represent a valid DBN). This 
ensures that, under the restriction of Al, the optimal 
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DBN structure (namely, one that corresponds to a 
that minimizes KL divergence with respect to the true 
joint probabihty distribution, for the data) can be 
obtained by independently picking the highest mutual in- 
formation parents, TTj{t), for each Xj{t) for j = 1, . . . , M 
(and, because of A2 and A3, wc need to carry-out this 
parents' selection step only for the M nodes in any one 
time slice, t, that satisfies {W <t<T)). 

4 Fixed-delay episodes 

Frequent episode discovery is a popular framework in 
temporal data mining [8, 12, 1]. The data in this frame- 
work is a single long stream of events over a finite alpha- 
bet, as defined at the beginning of Sec. 3 (cf. Exam,ple 1). 
In the formalism of [9], an £-node (serial) episode, a, is 
defined as a tuple, (V^, <a,ga)i where Va = {v\, . . . ,V(i} 
denotes a collection of nodes, <a denotes a total or- 
der^ such that Vi t'i-i-i, i = 1, . . . , (i? — 1). If 
9a{vj) = = !,...,£, we use the graphical nota- 

tion (Aj-^ ■ ■ ■ ^ Aj^) to represent a. An occurrence of 
a in event stream, s = {{Ei,ti),{E2,t2), ■ ■ ■ ,{En,tn)), 
is a map h : — > {1, . . . , n} such that (i) E^y.^ = 
g(vj) \fvj G Va, and (ii) for all Vi <a Vj in Vq, the times 
of occurrence of the i^^ and j^^ event in the occurrence 
are related according to < th(vj) in s. 

Example 2 Consider a 3-node episode a — {Va,<a 

,ga), such that, Va = {vi,V2,V3}, Vi V2, V2 <a V3 

and Vi <a f3, a,nd ga{vi) = A, ga{v2) = B and 
9a{v3) = C. The graphical representation for this episode 
is a = {A —> B ^ C), indicating that in every oc- 
currence of a, an event of type A must appear before 
an event of type B, and the B must appear before an 
event of type C. For example, in sequence (1), the sub- 
sequence ((A, 1), (i?, 3), (C, 9)) constitutes an occurrence 
of {A ^ B ^ C). For this occurrence, the corresponding 
h-map is given by, h{vi) = 1, h{v2) = 2 and h{v3) = 5. 

In the episode formalism reviewed so far, the exact 
time-stamps on the events in the data stream are only 
used to check the time-order of events constituting an 
episode occurrence. There are many ways to incorporate 
explicit time-duration constraints in episode occurrences 
(like the windows- width constraint of [9] , or the dwelling 
time constraint of [8]). Episodes with inter-event gap 
constraints were introduced in [12]. For example, the 
framework of [12] can express the temporal pattern "B 
must follow A within 5 time-ticks and C must follow B 
within 10 time-ticks." Such a pattern is represented us- 
ing the graphical notation, (A ^^^-^ B C). In this 
paper, we use a simple sub case of the inter-event gap 
constraints, in the form of fixed inter-event time-delays. 

^In general, <« can be any partial order over Va- In this paper, 
we only consider the case of episodes with a total order over Va. 
In [9], these are referred to as serial episodes. 



Here, each intcr-evcnt time constraint is represented by 
a single delay rather than a range of delays. Wc will re- 
fer to such episodes as fixed-delay episodes. For example, 
(^ — > B — > C) represents a fixed-delay episode, every 
occurrence of which must comprise an A, followed by a 
B exactly 5 time-ticks later, which in-turn is followed by 
a C exactly 10 time-ticks later. 

Definition 4.1 An £-node fixed-delay episode is defined 
as a pair, (a.V), where a = (Va, <a, ga) is the usual 
(serial) episode of [9], and V = {Si, . . . , Si-i) is a se- 
quence of {£ — 1) non-negative delays. Every occur- 
rence, h, of the fixed-delay episode in an event sequence, 
s = {{Ej^ ,ti),. . . , {Ej^ , tn)), must satisfy the inter-event 
constraints, Si = (th(vi+i) -th(vi)), i = !)• 

(j4jj — ^- • ■ • Aji, ) is the graphical notation for inter- 
event episode, (a, 2?), where Aj. = ga{vi), i = 



The framework of frequent episode discovery requires the 
notion of episode frequency. This can be defined in many 
ways [9,6]. In this paper, we use the notion of distinct oc- 
currences to define the frequency of fixed-delay episodes. 

Definition 4.2 Two occurrences, hi and h2, of a fixed- 
delay episode, {a,V), are said to be distinct, if they do 
not share any events in the data stream, s. Given a 
user-defined, W > 0, frequency of (a, V) in s, denoted 
fs{oi, V, W), is defined as the total number of distinct oc- 
currences of (a, V) in s that terminate strictly after W. 



In general, counting distinct occurrences of episodes suf- 
fers from computational inefficiencies [6]. (Each occur- 
rence of an episode {A ^ B ^ C) \s a substring that 
looks like A * B * C, where * denotes a variable-length 
don't-care, and hence, coimting all distinct occurrences 
in the data stream can require an unbounded number of 
automata for each episode). However, in case of fixed- 
delay episodes, it is easy to track distinct occurrences 
efficiently. For example, when counting frequency of 

3 5 

{A — > B — > C), if we encounter an A at time t, to 
recognize an occurrence involving this A we only need 
to check for a i? at time (t 3) and for a C at time 
(t-l-8). In addition to being attractive from an efficiency 
point-of-vicw, wc will show next in Sec. 4.1 that the dis- 
tinct occurrences-based frequency count for fixed-delay 
episodes will allow us to interpret relative frequencies as 
probabilities of DBN marginals. (Note that the W in 
Definition 4-2 is same as length of the history window 
used in the constraint Al. Skipping any occurrences 
terminating in the first W time-ticks makes it easy to 
normalize the frequency count into a probability mea- 
sure). 
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4.1 Mcirginals from episode frequencies 

In Sec. 3 we derived the optimal DBN structure for an 

event sequence under constraints Al and A2, and as- 
sumption A3. The main result was that for t > W, the 
(optimal) parents of a node, Xj{t), corresponds to the 
subset, -Kjit) C {Xkir) : 1 < k < M; {t - W) < t < t}, 
which maximizes the mutual information, I[Xj(t); TTjit)]. 
In this section, we describe how to compute this mu- 
tual information from the frequency counts of fixed-delay 
episodes. 

Every subset of event-indicators in the network is as- 
sociated with a fixed-delay episode. 

Definition 4.3 Let {X^{t) : j = 1, . . . , M; i = 
l,...,r} denote the collection of event-indicators used 
to model event stream, s = ((£^i,fi), . . . (S„,t„)), over 
alphabet, £ = {Ai. ...,Am}- Consider an £-size sub- 
set, X = {Xjj^{ti), . . . , Xjg{ti)}, of these indicators, 
and without loss of generality, assume ti < ■ ■ ■ < te. 
Define the (£ — 1) inter-event delays in X as follows: 
Sj = {tj^i — tj), j = 1, ...,{£ — 1). The fixed-delay 
episode, {a{X),T>{X)), that is associated with the sub- 
set, X , of event-indicators is defined by a{X) = (Aj^ 
■ ■ ■ ^ Ajg), and V(X) = {6i, . . . , 6e-i}. In graphical no- 
tation, the fixed-delay episode associated with X can be 
represented as follows: 

{a{X),ViX)) = {Aj,^...''-^'Ai,). (8) 



For computing mutual information using Eq. (6), 
we need the marginals of various subsets of event- 
indicators in the network. Given a subset like X = 
{Xj^{ti), . . . , Xj^{te)}, we need estimates for probabil- 
ities of the form, P[Xi^{ti) = ai,..., Xi^{tt) = at], 
where aj £ {0, 1}, j = 1, . . . The fixed-delay episode, 
{a{X),V{X)), that is associated with X is given by 
Definition 4-3 and its frequency in the data stream, 
s, is denoted by fs{a{X),T>{X),W) (as per Defini- 
tion 4-2) where W denotes length of history window as 
per Al. Since an occurrence of the fixed-delay episode, 
{a{X),V{X)), can terminate in each of the (T—W) time- 
ticks in s, the probability of an all-ones assignment for 
the random variables in X is given by: 

(9) 

For all other assignments (i.e. for assignments that are 
not all ones) we need to use the inclusion-exclusion 
formula to obtain corresponding probabilities. The 
inclusion-exclusion principle has been used before in data 
mining, e.g. for approximating queries using frequent 
itemsets [14]. The idea is to obtain exact or approxi- 
mate frequency counts for arbitrary boolean queries us- 
ing only counts of frequent itemsets in the data. In 



our case, counting distinct occurrences of fixed-delay 
episodes facilitates use of the inclusion-exclusion for- 
mula for obtaining the probabilities needed in the mu- 
tual information expression of Eq. (6). Consider the set, 
X = {Xj-^ [ti], . . . , Xj^ {t( )}: of £ event-indicators, and let 
(ai,...,a£), Oj G {0,1}, j = !,...,£, be any general 
assignment for the event-indicators in X. Let U C X de- 
note the set of indicators out of X for which correspond- 
ing assignments in A are all I's, i. e. U = {Xjj. G X : 
k s.t. afc = 1 in ^, 1 < /c < £}. The inclusion-exclusion 
formula can now be used to compute the probabilities as 
follows: 

P[Xj^ =ai,...,Xj^ = ai] 

3^ s.t. 

ucycx 

where we use fs{y) as short-hand for fs(a.{y), P(3^), W), 
the distinct occurrences-based frequency (cf. Defini- 
tion 4-2) of the fixed-delay episode, {a{y),T>{y)). It 
is possible to verify that summing the expression in 
Eq. (10) over all possible binary i?-tuples, {ai, . . . ,ae), 
always yields 1. Thus, in case of fixed-delay episodes, 
suitably-normalized frequency counts can be regarded as 
corresponding marginal probabilities. 

5 Excitatory networks 

In this section, we describe a restricted class of networks, 
called excitatory networks where only certain kinds of 
conditional dependencies among nodes arc pcrimitted. 
In general, each event-type from the alphabet can have 
some propensity of random occurrence (and this of 
course, can be different for different event-types). In so- 
called excitatory networks the only way to increase the 
propensity of occurrence of an event-type is by the oc- 
currence of specific event-types at specific delays in the 
immediate past. For example, we can have a conditional 
dependency such as, "whenever A, B and C occur (say) 
2 time-ticks apart, the probability of occurrence of D in- 
creases." No other kinds of conditional dependencies are 
permitted in excitatory networks. In other words, one 
or more event-types cannot increase the propensity of 
another by not occurring in the data stream. This kind 
of constraint appears naturally in neuroscience, where 
one is interested in unearthing conditional dependency 
relationships among neuron spiking patterns. There are 
several regions in the brain that are known to exhibit pre- 
dominantly excitatory characteristics and in these cases 
a neuron cannot increase the firing rate of another by not 
firing. 

In the context of DBN structure learning, this re- 
striction translates to event-types frequently appearing 
alongside their respective parents (with specific delays). 
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This motivates the use of frequent episodes to restrict the 
search space for parent sets. We note that, in general, it 
is difficult to infer an optimal network by inspection of 
the set of frequent episodes and/or the association rules 
that can be derived from it. This is because, there can be 
several frequent episodes/rules ending in the same event- 
type and these may even be of different sizes. There 
is no obvious way to systematically analyze frequencies 
and confidence values among these candidates to pick 
the optimal parent (s) in the network. In Sec. 7, we will 
present some examples to illustrate the shortcomings of 
our frequency based approach in detecting higher order 
causative relationships. 

6 Algorithms 

In Sees. 3-4, we developed the formalism for learning 

an optimal DBN structure from event streams by using 
distinct occurrences-based counts of fixed-delay episodes 
to compute the DBN marginal probabilities. The top- 
level algorithm for discovering the network is to fix any 
time t > W and consider each Xi{t), i = 1, . . . , M, in- 
turn. The best parent set, 7Ti{t), for Xi{t) is selected 
by searching over subsets of event-indicators (in the W- 
length history of Xi{t)) for the one that maximizes the 
mutual information, I[Xi{t) ; 7ri(f)]. To compute this 
mutual information we use the formalism of fixed-delay 
episodes. The marginal probabilities required in the com- 
putation of I[Xi{t) ; TTi{t)] are obtained using Eq. (10) 
which basically uses the frequencies of appropriate fixed- 
delay episodes in an inclusion-exclusion formula. Finally, 
since we are looking for only excitatory connections, we 
restrict our search space to frequent fixed-delay episodes. 
Thus the first step in our DBN learning algorithm is to 
detect all frequent fixed-delay episodes with a span less 
than W. This is described in Sec. 6.1. These frequent 
episodes (along with their respective frequencies) are in- 
put to the actual parents-search algorithm (which con- 
structs different parents and picks the one with highest 
mutual information). This is explained next in Sec. 6.2 

6.1 Discovering fixed-delay episodes 

However for long patterns and low support thresholds, an 
Apriori like algorithm incurs substantial costs and fur- 
ther must repeatedly scan the entire database for each 
level. In Algorithm 1 we present a pattern-growth based 
algorithm for mining frequent episodes (with fixed de- 
lays). 

The pattern-growth procedure given in Algorithm 1 

takes as input an episode a, a set of integers Va, and 
the actual event sequence <S. T>a is a set time stamps 
ti such that there is an occurrence of a starting at ti in 
S. For example, say at level 1 we have a = (C), then 
2^(C) = {1)4,5,8,9} for the event sequence S shown in 
Fig 1. The algorithm proceeds by obtaining counts for 



Procedure 1 jxillc ni._(jruir((\..Vi^..S) 

Input: A''-node episode a = {Ej^ ...£^(jv)) and 

event sequence S = {{Ei,ti)},i G {l...n}, Length 
of history window W. 

1: A = W — span{a) 

2: for all ^ e 5 do 

3: for (5 = to A do 

4: if (5 = and a[l] > A then 

5: continue 

6: p = A^a 

7: {Obtain count of /3 in projected data sequence 
8: for all i eT>a do 

9: iEi,U)=S\i] 

10: if 3j such that Ej = A and ti — tj = 6 then 

11: Increment p. count 

12: V^=V0U {j} 

13: if (i.count > then 

14: Add /3 to J^(fc -M) 

15: if |/3| < fc -I- 1 then 

16: pattern-qrowiB, Vg , <S) 



A _L L L ! ! 

B >l. X I 

C I ' ■■J - J '•--♦I I 

1234 56789 10 

Figure 1: An event sequence showing 3 distinct occur- 
rences of the episode A ^ B ^ C . 

all episodes [3 generated by extending a e.g. B ^ C , 

5 2 

. . ., A C etc. For an episode say (3 = B ^ C, the 
count is obtained looking for an occurrence of event B 

at time tj = ti — 2 where ti E Vb- In the example 
tj={2,3,6}. The number of such occurrences gives the 

2 

count of B C. At every step the algorithm tries to 
grow an episode with count > 6 otherwise stops. 

6.2 Learning DBN structure 

In Sec. 3 we derived the conditions of an optimal DBN 

structure. Based on our assumptions, it suffices to look 
at a one time instance t > W and ascertain the parents of 
each of the M nodes X(t). Translation invariance then 
automatically allows us to fix the parents of all other 
X(i)'s at time instance {t > W). 

The algorithm considers one node corresponding to 
each event-type in the alphabet in-turn, and searches for 
its best parent set. Based on our discussion in Sec. 5, we 
restrict the search space to only frequent episodes (end- 
ing in the event- type). If there are no frequent episodes 
ending in an event-type, we declare it as a root node. In 
addition from mutual information theory we know that 



for (5 = to A do 

if <5 = and a[l] > A then 
continue 

p = A^a 

{Obtain count of /3 in projected data sequence 

for all i G Va do 

iEi,ti)=S\i] 

if 3j such that Ej = A and ti — tj = 6 then 
Increment p.count 
Vp=V0U {j} 
if (i.count > then 
Add /3 to J^ik + 1) 
if |/3| < fc -I- 1 then 
pattern-grow{P, V/) , <S) 
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in general 



7 Experimental Results 



i[x-y]>i[X;Z\,yzciy 



(11) 



where X is a single random variable and y and Z are sets 
of random variables. But on removing the nodes that do 
not contribute in causing Xj from the parent set tt^- , the 
mutual information decreases only slightly. This is the 
basis of Algorithm 2 where at each step the parent set 
of size m, 7r™(t) is chosen from the frequent episode of 
size (m + 1) which gives the highest mutual information 
with Xj{t). In the case where a parent set of size n, 
7r"(f) with n > m has already been selected for Xj{t) 
in an earher iteration, the set 7r™(i) replaces 7r"(t) if 
I[X,{t),nJ'{t)] - I[Xj{t),nJ'{t)] < e and TT^{t) C 7r^"(t). 

In addition nj" replaces if I[Xj,ir]'] > I[Xj,-Kj'] 
for TT™ ^ tt". Therefore using the e criteria we can it- 
eratively remove nodes from the parent set that do not 
contribute in the information theoretic sense towards the 
cause of Xj. 



Algorithm 2 DBN learning from frequent episodes 



/* Initialize */ 



h= {} /* Empty hash-map */ 
for i = k down to 1 do 
for all a € Fi+i do 

/* Fi+i: frequent episodes of size i + 1 */ 
6: A = Last event of a 
7: par = prefix{a) 

/* par: first [a — 1| nodes chosen as candidate 
parents of A with delays corresponding to inter- 
event gaps in a */ 
9; if A ^ h then 

h{A) = {par, MI {A, par), i) 
else 

{parpj-Qy , miprev ? 

level) = h{A) 
if level = i + 1 then 
mi = MI (A, par) 
if I mi — mij 

h{A) = {par, mi, i) 
else if level = i then 
mi = MI {A, par) 



10 
11 

12 
13 
14 
15 
16 
17; 
18 
19 
20 
21 



if mi > miprev then 



h{A) = {par, mi, i) 
Output: DBN = {{A, h[A].par),yA} gives the DBN 



The time complexity of computing mutual information 
with a parent set of size k is 0(2'°) as we have to compute 
2*^ value assignments to all the nodes in the parent set 
(which take values or 1). But since A; is a user-supplied 
parameter (and assumed constant for a given run of Al- 
gorithm 2), the time complexity is output-sensitive with 
linear dependence on the number of frequent episodes 
0(|i^j+i|)at each level i. 



We present results on data gathered from both mathe- 
matical models of spiking neurons as well as real neuro- 
science datasets. 

7.1 Data generation model 

Our approach here is to model each neuron as an inhomo- 

geneous Poisson process^ whose firing rate is a complex 
function of the input received by the neuron: 



Xi{t) = 



X 



l + exp{-Ii{t)+0) 



(12) 



Eq 12 gives the firing rate of the i*'' neuron at time t. The 
network inter-connect allowed by this model gives it the 
amount of sophistication required for simulating higher- 
order interactions. More importantly, the model allows 
for variable delays which mimic the delays in conduction 
pathways of real neurons. 



^/3yYj-(t-n,)- 
j 



(13) 

In Eq 13, ^(t-Tij ) is the indicator of the event of a spike 
on j*'' neuron Ty time earlier. The higher order terms in 
the input contribute to the firing rate only when the i*^ 
neuron received inputs from all the neurons in the term 
with corresponding delays. With suitable choice of pa- 
rameters /?(.) one can simulate a wide range of networks. 



7.2 Types of Networks 

In this section we demonstrate the effectiveness of our ap- 
proach in unearthing different types of networks. Each 
of these networks was simulated by setting up the appro- 
priate inter-connections, of suitable order, in our mathe- 
matical model. 

Causative Chains: These are simple first order inter- 
actions forming linear chains. The parent set for each 
random variable is a single variable. Observe that this 
class includes loops in the underlying graph that would 
be 'illegal' in a static Bayesian network formulation. A 
causative chain is perhaps the easiest scenario for DBN 
inference. Here a network with 50 nodes is simulated 
for 60 sec on the multi-neuronal simulator, where the 
conditional probability is varied form 0.4 to 0.8. For a 
reasonably high conditional probability (0.8), we obtain 
100% precision for a reasonablly wide range of the fre- 
quency threshold ([0.002,0.038]). The recall is also sim- 
ilarly high but drops a bit toward higher values of the 
frequency threshold. For the low conditional probability 
scenario, the number of frequent episodes mined drops 
to zero and hence no network is found (implying both 
a precision and recall of 0). A similar experiment was 



^simulator courtesy Mr. Raajay, M.S. Student, IISc, Bangalore. 
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conducted for different values of parameter e and k. For 
this particular network, the results are robust as there 
are only first order interactions. 

Higher-order causative chains: A higher-order chain is 
one where parent sets are not restricted to be of cardi- 
nality one. In the example network of Fig. 2(b) we have 
two disconnected components: one first order causative 
chain formed by nodes A,B,C and D and a higher-order 
causative chain formed by M,N,0 and P. In the higher or- 
der chain, O fires when both M and N fire with appropri- 
ate timing and P fires when all of M, N, and O fire. Look- 
ing only at frequent episodes, both A ^ B ^ C ^ D 
and M ^ N ^ O ^ P turn out to be frequent. But 
using the e criteria for our algorithm we can distinguish 
the two components of the circuits to be of different or- 
ders. Table 1 demonstrates results on this network as e 
is varied from 0.00001 to 0.01, and for the same range of 
conditional probabilities as before. A low e results in a 
decrease of precision, e.g., our approach finds A and B 
to be parents of C. Conversely, for higher values of e our 
algorithm might reject the set M, N, P as parents of P 
and retain some subset of them. A final observation is in 
reference to the times from Table 1; the values presented 
here includes the time to compute the mutual informa- 
tion terms plus the time to mine frequent patterns, and 
the significant component is the latter. 




(c) Syn-fire Chains (d) Polychronous 
Circuits 



Figure 2: Four classes of DBNs investigated in our ex- 
periments. 

Syn-fire Chains: Another important pattern often seen 



Table 1: DBN results for network shown in Fig. 2(b) for 
varying conditional probability (used in generation) and 

e (used in mining); base firing rate = 20Hz. 





JIjJ. Jolit^Jli 




IVV A flii 


Prop 


])r()l) 




(total) 




ision 


0.8 


0.00001 


18.31 


100 


75 


0.8 


0.0001 


18.31 


100 


100 


0.8 


0.001 


18.3 


88.89 


100 


0.8 


0.01 


18.31 


77.78 


100 


0.4 


0.00001 


15.02 


100 


81.82 


0.4 


0.0001 


14.98 


100 


100 


0.4 


0.001 


14.98 


88.89 


100 


0.4 


0.01 


14.98 


66.67 


100 



in neuronal spike train data is that of synfire chains. This 
consists of groups of synchronously firing neurons strung 
together repeating over time. In an earlier work [12], 
it was noted that discovering such patterns required a 
combination of serial and parallel episode mining. But 
the DBN approach applies more naturally to mining such 
network structures. 

Polychronous Circuits: Groups of neurons that fire in 
a time-locked manner with respect to each other are refer 
to as polychronous groups. This notion was introduced 
in [4] and gives rise to an important class of patterns. 
Once again, our DBN formulation is a natural fit for dis- 
covering such groups from spike train data. An example 
of a polychronous circuit is show in Fig 2(d) and its cor- 
responding results in Table 2. 



Table 2: DBN results for network shown in Fig. 2(d) for 
different Freq. thresh, and epsilon 



Cond. 


Freq. 


Epsilon 


Time 


Recall 


Prec- 


prob. 


Thresh 




(total) 




ision 


0.8 


0.002 


0.0005 


22.3 


100 


100 


0.8 


0.014 


0.0005 


18.8 


40 


100 


0.8 


0.002 


0.00001 


22.81 


100 


53.57 


0.8 


0.002 


0.01 


22.91 


93.33 


100 


0.4 


0.002 


0.0005 


15.48 


53.33 


100 


0.4 


0.014 


0.0005 


15.11 


13.33 


100 


0.4 


0.002 


0.00001 


15.28 


53.33 


61.54 


0.4 


0.002 


0.01 


15.3 


46.67 


100 



7.3 Scalability 

The scalability of our approach with respect to data 
length and number of variables is shown in Fig 3 and 
Fig 4. Here four different networks with 50, 75, 100 and 
125 variables respectively were simiilatcd for time dura- 
tions ranging from 20 sec to 120 sec. The base firing 
rate of all the networks was fixed at 20 Hz. In each net- 
work 40% of the nodes were chosen to have upto 3 three 
parents. The parameters of the DBN mining algorithm 
were choosen such that recall and precision are both high 
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50-NodeE 
75-NodeE 

125-NndM 



213 40 eo SO 100 120 
Duralion of event sequence (in sec] 



Figure 3: Plot of time taken for mining frequent episodes 
vs. data length in sec 



(> 80%). It can be seen in the figures that for a network 
with 125 variables, the total run-time is of the order of 
few minutes along with recall > 80% and precision at 
almost 100%. 

Another way to study scalability is w.r.t. the density 
of the network, defined as the ratio of the number of 
nodes that are descendants for some other node to the 
total number of nodes in the network. Fig 5 shows the 
time taken for mining DBN when the density is varied 
from 0.1 to 0.6. 



I 



50-NodeB 
75-Nodes 
1 O0-Nnd9! 
125-NndB! 



Dsnsity 



Figure 5: Plot of total time taken for DBN discovery vs. 
network density 



50-NodeE 
75-NodeE 
lOO-Nnds. 
125-NndM 



1 1 1 1 r 

20 40 eo so 100 120 
Duralion of event sequence (in sec] 



Figure 4: Plot of time taken for mutual information com- 
putation and searching for the parent set vs. data length 
in sec 



7.4 Sensitivity 

Finally, we discuss the sensitivity of the DBN mining 
algorithm to the parameters (0, e). To obtain precision- 
recall curves for our algorithm applied to data sequences 
with different characteristics, we vary the two parame- 
ters 9 and e in the ranges {0.002, 0.008, 0.014, 0.026, 
0.038} and {0.00001, 0.0001, 0.001, 0.01} respectively. 
The data sequence for this experiment is generated from 
the multi-neuronal simulator using different settings of 
base firing rate, conditional probability, number of nodes 
in the network, and the density of the network as defined 
earlier. 

The set of precision-recall curves are shown in Fig 6. 
It can be seen that the proposed algorithm is effective 
for a wide range of parameter settings and also on data 
with sufficiently varying characteristics. 

7.5 Mining DBNs from MEA recordings 

Multi-electrode arrays (see Fig. 7) are high throughput 
ways to record the spiking activity in neuronal tissue 
and are hence rich sources of event data where events 
correspond to specific neurons (or clumps of neurons) 
being activated. We use data from dissociated cortical 
cultures gathered by Steve Potter's laboratory at Georgia 
Tech [15] which gathered data over several days. The 
DBN shown in Fig. 8 depicts a circuit discovered from 
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For varying Freq. Hi . 



For varying FrBq. Hi. 



For varying Frsq. Th. 



For varying Freq. Hi . 





R=C3ll 

For varying opEiion 



Recall 

Forvarying epEiion 



Forvarying epsiion 



Recall 

Forvarying opEiion 




Figure 6: Precision-recall curves for different parameter values in the DBN mining algorithm. 



the first 15 min of recording on day 35 of culture 2-1. The 
overall mining process takes about 10 min with threshold 
e = 0.0015 with DBN search parameter e = 0.0005. 




Figure 7: Micro electrode array (MEA) used to record 
spiking activity of neurons in tissue cultures. 



02 




Figure 8: DBN structure discovered from neuronal spike 
train data. 

In order to establish that this network is in fact sig- 
nificant we run our algorithm on several surrogate spike 
trains generated by replacing the neuron labels of each 
spikes in the real data with a randomly chosen neuron 
label. These surrogates are expected to break the tempo- 
ral correlations in the data and yet preserve the overall 
summary statistics. No network structure was found in 
25 such surrogate sequences. We are currently in the pro- 



cesses of characterizing and interpreting the usefulness of 
such networks found in real data. 

8 Discussion 

We have presented the beginnings of research to relate 
inference of DBNs with frequent episode mining. The 
key contribution here is to show how, under certain as- 
sumptions on network structure, data and distributional 
characteristics, we are able to infer the structure of DBNs 
using the results from frequent episode mining. While 
our experimental results provide convincing evidence of 
the efficacy of our methods, in future work we aim to 
provide strong theoretical results supporting our experi- 
ences. 

An open question of interest is to characterize (other) 
useful classes of DBNs that have both practical rele- 
vance (like excitatory circuits) and which also can be 
tractably inferred using sufficient statistics of the form 
studied here. 

9 Repeatability 

Supplementary material, algorithm implementations, 
and results for this paper are hosted at http://neural- 
code.cs.vt.edu/dbn. 
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